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CONFIDENCE INTERVALS FOR HIGH-DIMENSIONAL 
LINEAR REGRESSION: MINIMAX RATES AND 
ADAPTIVITY* 

By T. Tony Cai, and Zijian Guo 
University of Pennsylvania 

Confidence sets play a fnndamental role in statistical inference. 

In this paper, we consider confidence intervals for high dimensional 
linear regression with random design. We first establish the conver¬ 
gence rates of the minimax expected length for confidence intervals 
in the oracle setting where the sparsity parameter is given. The focns 
is then on the problem of adaptation to sparsity for the construc¬ 
tion of confidence intervals. Ideally, an adaptive confidence interval 
should have its length automatically adjusted to the sparsity of the 
unknown regression vector, while maintaining a prespecified coverage 
probability. It is shown that such a goal is in general not attainable, 
except when the sparsity parameter is restricted to a small region 
over which the confidence intervals have the optimal length of the 
nsnal parametric rate. It is further demonstrated that the lack of 
adaptivity is not due to the conservativeness of the minimax frame¬ 
work, but is fundamentally caused by the difficulty of learning the 
bias accurately. 


1. Introduction. Driven by a wide range of applications, high-dimensional 
linear regression, where the dimension p can be much larger than the sample 
size n, has received significant recent attention. The linear model is 

(1.1) y = Xl3 + e, e~A(0,a2l), 

where y G MP, X G and /3 G M^. Several penalized/constrained min¬ 

imization methods, including the Lasso [22], Dantzig Selector [11], square- 
root Lasso [1], and scaled Lasso [21] have been proposed and studied. Under 
regularity conditions on the design matrix X, these methods with a suitable 
choice of the tuning parameter have been shown to achieve the optimal rate 
of convergence under the squared error loss over the set of A:-sparse 

regression coefficient vectors with k < where c > 0 is a constant. That 
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is, there exists some constant C > 0 such that 


( 1 . 2 ) 


llAiioi'c \ 



logp 


n 



where ||/3||o denotes the number of the nonzero coordinates of a vector /3 G 
See, for example, [24, 2, 11, 21]. A key feature of the estimation problem 
is that the optimal rate can be achieved adaptively with respect to the 
sparsity parameter k. 

Confidence sets play a fundamental role in statistical inference and con¬ 
fidence intervals for high-dimensional linear regression have been actively 
studied recently with a focus on inference for individual coordinates. But, 
compared to point estimation, there is still a paucity of methods and fun¬ 
damental theoretical results on confidence intervals for high-dimensional re¬ 
gression. Zhang and Zhang [25] was the first to introduce the idea of de¬ 
biasing for constructing a valid conhdence interval for a single coordinate 
/3j. The confidence interval is centered at a low-dimensional projection es¬ 
timator obtained through bias correction via score vector using the scaled 
Lasso as the initial estimator. [14, 15, 23] also used de-biasing for the con¬ 
struction of confidence intervals and [23] established asymptotic efficiency for 
the proposed estimator. All the aforementioned papers [25, 14, 15, 23] have 
focused on the ultra-sparse case where the sparsity k <C is assumed. Un¬ 
der such a sparsity condition, the expected length of the confidence intervals 
constructed in [25, 15, 23] is at the parametric rate ^ and the procedures 
do not depend on the specific value of k. 

Compared to point estimation where the sparsity condition k <C 
is sufficient for estimation consistency (see equation (1.2)), the condition 
k <C for valid confidence intervals is much stronger. There are several 

natural questions; What happens in the region where ^ k < 

still possible to construct a valid conhdence interval for in this case? Can 

one construct an adaptive honest conhdence interval not depending on k? 

The goal of the present paper is to address these and other related ques¬ 
tions on conhdence intervals for high-dimensional linear regression with ran¬ 
dom design. More specihcally, we consider construction of conhdence inter¬ 
vals for a linear functional T (/3) = f3, where the loading vector ^ G 

is given and ^ with c > 1 being a constant. Based on the 

^^^iGsupp(^) lsi| 

sparsity of we focus on two specihc regimes: the sparse loading regime 
where ||^||o < Ck, with C > 0 being a constant; the dense loading regime 
where ||^||o satisfying (2.7) in Section 2. It will be seen later that for conh¬ 
dence intervals T (/?) = /3j is a prototypical case for the general functional 
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T (/3) = ^'^(3 with a sparse loading and T (/3) = Yl^=i A is a representative 
case for T (/3) = with a dense loading 

To illustrate the main idea, let us first focus on the two specific functionals 
T {(3) = Pi and T (/3) = We establish the convergence rate of the 

minimax expected length for confidence intervals in the oracle setting where 
the sparsity parameter k is given. It is shown that in this case the minimax 
expected length is of order + k^^^ for confidence intervals for Pi. An 
honest confidence interval, which depends on the sparsity k, is constructed 
and is shown to be minimax rate optimal. To the best of our knowledge, 
this is the first construction of confidence intervals in the moderate-sparse 
region ^ k < If the sparsity k falls into the ultra-sparse region 

k < the constructed confidence interval is similar to the confidence 

intervals constructed in [25, 15, 23]. On the other hand, the convergence rate 
of the minimax expected length of honest confidence intervals for 'Y^=i 

the oracle setting is shown to be A rate-optimal confidence interval 

that also depends on k is constructed. It should be noted that this confidence 
interval is not based on the de-biased estimator. 

One drawback of the constructed confidence intervals mentioned above is 
that they require prior knowledge of the sparsity k. Such knowledge of spar¬ 
sity is usually unavailable in applications. A natural question is: Without 
knowing the sparsity k, is it possible to construct a confidence interval as 
good as when the sparsity k is known? This is a question about adaptive in¬ 
ference, which has been a major goal in nonparametric and high-dimensional 
statistics. Ideally, an adaptive confidence interval should have its length au¬ 
tomatically adjusted to the true sparsity of the unknown regression vector, 
while maintaining a prespecified coverage probability. We show that, unlike 
point estimation, such a goal is in general not attainable for confidence in¬ 
tervals. In the case of confidence intervals for Pi, it is impossible to adapt 
between different sparsity levels, except when the sparsity k is restricted to 
the ultra-sparse region k < over which the confidence intervals have 
the optimal length of the parametric rate which does not depend on k. 
In the case of confidence intervals for shown that adaptation 

to the sparsity is not possible at all, even in the ultra-sparse region k < 
Minimax theory is often criticized as being too conservative as it focuses 
on the worst case performance. For confidence intervals for high dimensional 
linear regression, we establish strong non-adaptivity results which demon¬ 
strate that the lack of adaptivity is not due to the conservativeness of the 
minimax framework. It shows that for any confidence interval with guaran¬ 
teed coverage probability over the set of k sparse vectors, its expected length 
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at any given point in a large subset of the parameter space must be at least 
of the same order as the minimax expected length. So the confidence interval 
must be long at a large subset of points in the parameter space, not just at 
a small number of “unlucky” points. This leads directly to the impossibil¬ 
ity of adaptation over different sparsity levels. Fundamentally, the lack of 
adaptivity is caused by the difficulty in accurately learning the bias of any 
estimator for high-dimensional linear regression. 

We now turn to confidence intervals for general linear functionals. For a 
linear functional (3 in the sparse loading regime, the rate of the minimax 
expected length is ||^||2 where ||^||2 is the vector £2 norm of 

For a linear functional in the dense loading regime, the rate of the 
minimax expected length is Halloo is the vector norm 

of Regarding adaptivity, the phenomena observed in confidence intervals 
for the two special linear functionals T (/3) = /3j and T (/3) = Yli=i A extend 
to the general linear functionals. The case of confidence intervals for T (/3) = 
A with a sparse loading ^ is similar to that of confidence intervals 
for f3i in the sense that rate-optimal adaptation is impossible except when 
the sparsity k is restricted to the ultra-sparse region k < . On the other 

hand, the case for a dense loading ^ is similar to that of confidence intervals 
for A- adaptation to the sparsity k is not possible at all, even in the 
ultra-sparse region k < 

In addition to the more typical setting in practice where the covariance 
matrix S of the random design and the noise level a of the linear model 
are unknown, we also consider the case with the prior knowledge of S = I 
and (T = fJo- It turns out that this case is strikingly different. The minimax 
rate for the expected length in the sparse loading regime is reduced from 
ll^lb + ™ particular it does not depend on the 

sparsity k. Furthermore, in marked contrast to the case of unknown S and 
cr, adaptation to sparsity is possible over the full range k < i^- On the other 
hand, for linear functionals P with a dense loading the minimax rates 
and impossibility for adaptive confidence intervals do not change even with 
the prior knowledge of S = I and a = ao. However, the cost of adaptation 
is reduced with the prior knowledge. 

The rest of the paper is organized as follows: After basic notation is intro¬ 
duced, Section 2 presents a precise formulation for the adaptive confidence 
interval problem. Section 3 establishes the minimaxity and adaptivity re¬ 
sults for a general linear functional with a sparse loading Section 4 
focuses on confidence intervals for a general linear functional with a 
dense loading Section 5 considers the case when there is prior knowledge 
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of covariance matrix of the random design and the noise level of the linear 
model. Section 6 discusses connections to other work and further research 
directions. The proofs of the main results are given in Section 7. More dis¬ 
cussion and proofs are presented in the supplement [3]. 

2. Formulation for adaptive confidence interval problem. We 

present in this section the framework for studying the adaptivity of confi¬ 
dence intervals. We begin with the notation that will be used throughout 
the paper. 

2.1. Notation. For a matrix X G Xj., X.j, and Xij denote re¬ 

spectively the i-th row, j-th. column, and (i, j) entry of the matrix X, Xi-j 
denotes the i-th row of X excluding the j-th coordinate, and X_j denotes 
the submatrix of X excluding the j-th column. Let \p] = {1, 2, • • • ,p}. For 
a subset J C [p], Xj denotes the submatrix of X consisting of columns X.j 
with j £ J and for a vector x G xj is the subvector of x with indices 
in J and x_j is the subvector with indices in J^. For a set S, 151 denotes 
the cardinality of S. For a vector x G supp(x) denotes the support of 

x and the ig norm of x is dehned as ||x||q = 1^*1'^)'' for g > 0 with 

||x||o = |supp(x)| and ||x||oo = niaxi<j<p |xjj. We use e* to denote the i-th 
standard basis vector in M^. For a G M, a+ = max{a, 0}. We use 
a shorthand for max||X.jH 2 as a shorthand for maxi<j<p ||X.j H 2 

and min||X.jH 2 as a shorthand for mini<j<p ||X.j H 2 . For a matrix A and 
I < q < 00 , ||A||q = sup|| 3 ,||^=]^ II^T||q is the matrix ig operator norm. In par¬ 
ticular, ||A ||2 is the spectral norm. For a symmetric matrix A, Amin (^) and 
Amax (A) denote respectively the smallest and largest eigenvalue of A. We 
use c and C to denote generic positive constants that may vary from place 
to place. For two positive sequences and bn, an < bn means On < Cbn 
for all n and On > bn if bn < and On x bn if an < bn and bn < an, and 
an ^ bn if limsup„_^oo ^ = 0 and an ^ bn if bn ^ an- 

2.2. Framework for adaptivity of confidence intervals. We shall focus in 
this paper on the high-dimensional linear model with the Gaussian design, 

(2.1) Unx 1 — -^nxp/dpx 1 T Cfix 1) 6 ~ Nn(0, CT I), 

where the rows of X satisfy Xj. Xp(0, S), f = 1, ...,n, and are indepen¬ 
dent of e. Both S and the noise level a are unknown. Let 11 = denote 
the precision matrix. The parameter 6 = (/?, H, a) consists of the signal ft, 
the precision matrix H for the random design, and the noise level a. The 
target of interest is the linear functional of ft, T (/I) = where ^ G is a 


6 


T. T. CAI AND Z. GUO 


pre-specified loading vector. The data that we observe is Z = (Zi, • • • , ZnY , 
where Zj = Xi) G for i = 1, • • • , n. 

For 0 < a < 1 and a given parameter space 0 and the linear functional 
T (/3), denote by Xq, (0, T) the set of all (1 — a) level confidence intervals for 
T (/3) over the parameter space 0, 

( 2 . 2 ) 

X« (0, T) = ICI« (T, Z) = [1{Z), u{Z)] : inf P,(Z(Z) < T(/3) < u{Z)) > 1 - a 


For any confidence interval CIq, (T, Z) G Xq, (0,T), the maximum expected 
length over a parameter space 0 is defined as 

X(CI„ (T, Z), 0, T) = supE,L (Cl, (T, Z)), 

0e0 

where for confidence interval CIq,(T, Z) = [l{Z),u{Z)], L(CIq,(T, Z)) = 
u{Z) — 1{Z) denotes its length. For two parameter spaces 0i C 0, we de¬ 
fine the benchmark X* (0i, 0,T) as the infimum of the maximum expected 
length over 0i among all (1 — Q;)-level confidence intervals over 0, 

(2.3) L;(0i, 0,T)= inf X(CI, (T, Z), 0i, T). 

cic(T,z)eic(e,T) 

We will write L*(0,T) for L*(0,0,T), which is the minimax expected 
length of confidence intervals over 0. 

We should emphasize that X*(0i,0,T) is an important quantity that 
measures the degree of adaptivity over the nested spaces 0i C 0. A con¬ 
fidence interval Cl, (T, Z) that is (rate-optimally) adaptive over 0i and 0 
should have the optimal expected length performance simultaneously over 
both 01 and 0 while maintaining a given coverage probability over 0, i.e., 
Cl, (T, Z) G X, (0, T) such that 

L(CI,(T,Z),0i,T) xX;(0i,T) and L(CI, (T, Z), 0, T) x X;(0, T). 


Note that in this case X(CI, (T, Z), 0i, T) > X*(0i,0,T). So for two pa¬ 
rameter spaces 01 C 0, if X*(0i,0,T) L*(0i,T), then rate-optimal 

adaptation between 0i and 0 is impossible to achieve. 

We consider the following collection of parameter spaces, 

(2.4) 

< Xminm < Amax(f^) < Mi,0 < a < M 2 ] , 


eik) = \e = i/3,n,a): 




where Mi > 1 and M 2 > 0 are positive constants. Basically, Q{k) is the 
set of all /c-sparse regression vectors. ^ < Amin(^) < Amax(^) < Mi and 
0 < cj < M 2 are two mild regularity conditions on the design and noise level. 
The main goal of this paper is to address the following two questions: 
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1. What is the minimax length L*(0(A:),T) in the oracle setting where 
the sparsity level k is known? 

2. Is it possible to achieve rate-optimal adaptation over different sparsity 
levels? 

More specifically, for /ci <C /c, is it possible to construct a confidence 
interval CIq, (T, Z) that is adaptive over ©(fci) and 0(/c) in the sense 
that CIq (T, Z) G Xq (0 {k) , T) and 

X(CIq (T, Z), 0(A:i), T) X L;(0(A:i), T), 

L(CIq (T, Z), 0(fc), T) X L;(0(fc), T)? 

We will answer these questions by analyzing the two benchmark quantities 
L*(0(/c),T) and Lq(0(A:i), 0(A;), T). Both lower and upper bounds will be 
established. If (2.5) can be achieved, it means that the confidence inter¬ 
val CIq (T, Z) can automatically adjust its length to the sparsity level of 
the true regression vector /3. On the other hand, if L* (0(/ci), 0(A;), T) » 
L* (0(A;i), T), then such a goal is not attainable. 

For ease of presentation, we calibrate the sparsity level 

A: X for some 0 < 7 < ^, 

and restrict the loading ^ to the set 

^ G H (g, c) = G : |||||o = g, ^ / 0 and 

where c > 1 is a constant. The minimax rate and adaptivity of confidence 
intervals for the general linear functional If fl also depends on the sparsity 
of I- We are particularly interested in the following two regimes: 

1. The sparse loading regime: ^ G H {q,c) with 

(2.6) q < Ck. 

2. The dense loading regime: ^ G H (g, c) with 

(2.7) q = cp'^'^ with 27 < 7 ^ < 1. 

The behavior of the problem is significantly different in these two regimes. 
We will consider separately the sparse loading regime in Section 3 and the 
dense loading regime in Section 4. 


maXjgg^pp^^^ l^jj 


< c 


— f ? 
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3. Minimax rate and adaptivity of confidence intervals for sparse 
loading linear fnnctionals. In this section, we establish the rates of con¬ 
vergence for the minimax expected length of confidence intervals for /3 with 
a sparse loading ^ in the oracle setting where the sparsity parameter k of 
the regression vector /I is given. Both minimax upper and lower bounds are 
given. Confidence intervals for (3 are constructed and shown to be min¬ 
imax rate-optimal in the sparse loading regime. Finally, we establish the 
possibility of adaptivity for the linear functional (3 with a sparse loading 


3.1. Minimax length of confidence intervals for fi in the sparse loading 
regime. In this section, we focus on the sparse loading regime defined in 

(2.6) . The following theorem establishes the minimax rates for the expected 
length of confidence intervals for ffij3 in the sparse loading regime. 

Theorem 1. Suppose that 0 < a < ^ and k < cmin{p'^, for some 
constants c > 0 and 0 < 7 < ^. // ^ belongs to the sparse loading regime 

(2.6) , the minimax expected length for (1 — a) level confidence intervals of 
ffi(3 over 0 {k) satisfies 

(3,1) L;(0(t),jT/3)x||{||TT + t!5ir), 

\\/n n J 

Theorem 1 is established in two separate steps. 

1. Minimax upper bound: we construct a confidence interval Clf (3, Z) 
such that Clf {ffi(3, Z) G Xq, (0 {k) , ffifi) and for some constant C > 0 

(3.2) L (Clf Z) , 0 (fc), < Clieib f ^ + k^—) . 

\ n J 

2. Minimax lower bound: we show that for some constant c > 0 

(3.3) LI (0 (k ). C/3) > cllJIb (T + . 

\y/n n J 

The minimax lower bound is implied by the adaptivity result given in The¬ 
orem 2. We now detail the construction of a confidence interval CI^ {fifi fi, Z) 
achieving the minimax rate (3.1) in the sparse loading regime. The interval 
Clf (^T/3, Z) is centered at a de-biased scaled Lasso estimator, which gener¬ 
alizes the ideas used in [25, 15, 23]. The construction of the (random) length 
is different from the aforementioned papers as the asymptotic normality 
result is not valid once k > 
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Let {/3 ,(t} be the scaled Lasso estimator with Aq = 


(3.4) 

{/3, (t}= argmin 


/3eRr,CTeK+ 

Define 


(3.5) 

u = arg min 

ueM.p 

where S 

= ^XiX and An 


\\y-Xp\\l , ^ ^ ||X,-||2, 


2na 


+ 5 + 

J = 1 


‘} 


Clf Z) is centered at the following de-biased estimator 


(3.6) 


= + {y-XI3] , 


where /3 is the scaled Lasso estimator given in (3.4) and u is defined in 
(3.5). Before specifying the length of the confidence interval, we review the 
following definition of restricted eigenvalue introduced in [2], 

I|X5||2 


k{X, k, an) = min min 

JoC{l,-,p}, <57^0, 


(3.7) 

Define 

(3.8) 

Pi (k) = ||^||2(Tmin \ 1.01^ 


Dol<fc IIEglli^^ollfJjplli 


VnW^Jol 


m 1 / 1 , fclogp 

- \2^a/2 + Li (X, k) k -, logp(^ H- 




n 


n 


n 


where Zai 2 is the a/2 upper quantile of the standard normal distribution 

and 

(3.9) 


Cl (X, k) = 7000M) 


n 


min X. 


max < 1.25, 


912 max ||X.j||2 


m 


[X,k,405 (S^J^jlf)) 


Define the event 

(3.10) A = {iT<logp}. 

The confidence interval Clf (^T/3, Z) for is defined as 


(3.11) Cl^^iC/3,Z) = 


[H - Pi {k), p + Pi (/c)] on A 
{0} on 


It will be shown in Section 7 that the confidence interval Clf (^1/3, Z) has 
the desired coverage property and achieves the minimax length in (3.1). 
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Remark 1. In the special case of ^ = ei, the confidence interval defined 
in (3.11) is similar to the ones based on the de-biased estimators introduced 
in [25, 15, 23]. The second term (jj — X(3^ in (3.6) is incorporated to 

reduce the bias of the scaled Lasso estimator /3. The constrained estimator u 
defined in (3.5) is a score vector u such that the variance term u^'Eu is min¬ 
imized and one component of the bias term ||Su —^||oo is constrained by the 

tuning parameter Xn- The tuning parameter A„ is chosen as 12||^||2M^y^^^^ 

such that u = lies in the constraint set ||Su — ^||oo < in (3.5) with 
overwhelming probability. For Ci{X,k) defined in (3.9), it will be shown 
that it is upper bounded by a constant with overwhelming probability. 

3.2. Adaptivity of confidence intervals for fi in the sparse loading regime. 
We have constructed a minimax rate-optimal confidence interval for ffifi in 
the oracle setting where the sparsity k is assumed to be known. A major 
drawback of the construction is that it requires prior knowledge of k, which 
is typically unavailable in practice. An interesting question is whether it is 
possible to construct adaptive confidence intervals that have the guaranteed 
coverage and automatically adjust its length to k. 

We now consider the adaptivity of the conhdence intervals for In 
light of the minimax expected length given in Theorem 1 , the following the¬ 
orem provides an answer to the adaptivity question (2.5) for the confidence 
intervals for fi in the sparse loading regime. 

Theorem 2. Suppose that 0 < a < 5 and ki <k < cmin jp'’', for 
some constants c > 0 and 0 < 7 < ^ • Then 

(3.12) L;(0(A:i),0(fc),™ >ci||^||2 > 

for some constant ci > 0. 

Note that Theorem 2 implies the minimax lower bound in Theorem 1 by 
taking ki = k. Theorem 2 rules out the possibility of rate-optimal adaptive 
conhdence intervals beyond the ultra-sparse region. Consider the setting 
where fci <C A: and ^ < A; < In this case, 

L;(0(A;i),0(A:),^t^) x Ll{Q{k),ffifi) x H^lbA:^ » fi). 

n 

So it is impossible to construct a conhdence interval that is adaptive simul¬ 
taneously over 0(A;i) and &{k) when ^ k < and ki <C k. The only 
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possible region for adaptation is in the ultra-sparse region k < over 

which the optimal expected length of confidence intervals is of order and 
in particular does not depend on the specific sparsity level. These facts are 
illustrated in Figure 1. 


1 k log p 



Adaptive 


Not Adaptive 


Fig 1. Illustration of adaptivity of confidence intervals for p with a sparse loading 
For adaptation between 0(fci) and 0(fc) with ki <C k, rate-optimal adaptation is possible 
^ ^ impossible otherwise. 


So far the analysis is carried out within the minimax framework where 
the focus is on the performance in the worst case over a large parameter 
space. The minimax theory is often criticized as being too conservative. In 
the following, we establish a stronger version of the non-adaptivity result 
which demonstrates that the lack of adaptivity for confidence intervals is 
not due to the conservativeness of the minimax framework. The result shows 
that for any confidence interval CIq, Z), under the coverage constraint 
that CIq, (^T/3, Z) G Xq (0 [k] its expected length at any given 6* = 

(/3*,I,(t) G Q{ki) must be of order ||^||2 ■ So the confidence 

interval must be long at a large subset of points in the parameter space, not 
just at a small number of “unlucky” points. 


Theorem 3. Suppose that 0 < a < ^ and k < cmin{p'^, y^} for some 

constants c > 0 and 0 < 7 < 5. Let ki < {1 — Co) k — 1 and q < ^k for some 
constant 0 < Co < 1- Then for any 9* = (/?*, I, u) G 0 (fci) and C G H (g, c), 
(3.13) 


inf 

ci„(CT/3,z)ei^(0(fc),5T/3) 


Ee*L{Cla {C/3,Z))>cim\2 





for some constant ci > 0. 


Note that no supremum is taken over the parameter 9* in (3.13). Theorem 
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3 illustrates that if a confidence interval CIo Z) is “superefficient” at 
any point 6* = (/3*,I,u) G 0(A:i) in the sense that 

E0*L (CI„ Z)) « IIIII2 f ^ u, 

\yn n J 

then the confidence interval CIq, Z) can not have the guaranteed cov¬ 
erage over the parameter space Q{k). 


3.3. Minimax rate and adaptivity of confidence intervals for fii. We now 
turn to the special case T (/3) = fii, which has been the focus of several 
previous papers [25, 14, 15, 23]. Without loss of generality, we consider fii, 
the first coordinate of fi, in the following discussion and the results for any 
other coordinate fii are the same. The linear functional fii is the special case 
of linear functional of sparse loading regime with ^ = ei. 

Theorem 1 implies that the minimax expected length for (1 — a) level 
confidence intervals of fii over 0 (k) satisfies 

(3.14) L*^^eik),l3,)^-^ + k^-^. 

y/n n 

In the ultra-sparse region with k < the minimax expected length is 

of order However, when k falls in the moderate-sparse region <C 
k < the minimax expected length is of order k^^^^ and in this case 

k^^^ 3> Hence the confidence intervals constructed in [25, 14, 15, 23], 
which are of parametric length ^, asymptotically have coverage probability 

going to 0. The condition k < is necessary for the parametric rate 
[23] established asymptotic normality and asymptotic efficiency for a de- 
biased estimator under the sparsity assumption k <C Similar results 

have also been given in [19] for a related problem of estimating a single 
entry of a p-dimensional precision matrix based on n i.i.d. samples under 
the same sparsity condition k <C It was also shown that k <C is 
necessary for the asymptotic normality and asymptotic efficiency results. 

The following corollary, as a special case of Theorem 3, illustrates the 
strong non-adaptivity for confidence intervals of fii when k ^ 


Corollary 1 . Suppose that 0 < a < ^ and k < cmin{p'>', for 
some constants c > 0 and 0 < 7 < Let ki < {1 — Co) k — 1 for some 
constant 0 < Co < 1- Then for any 0* = (/?*, I, a) G 0 (fci), 


(3.15) inf E0*L(CI„(/3i,Z)) > Cl 

CIc.(/Ji,Z)eIc(e(A:),/3i) 


1 , logp\ 

+ k - cr, 

n n J 
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for some constant ci > 0 . 


4. Minimax rate and adaptivity of confidence intervals for dense 
loading linear fnnctionals. We now turn to the setting where the loading 
^ is dense in the sense of (2.7). We will also briefly discuss the special case 
Z]f=i A the computationally feasible confidence intervals. 


4.1. Minimax length of confidence intervals for j3 in the dense loading 
regime. The following theorem establishes the minimax length of confidence 
intervals of (3 in the dense loading regime (2.7). 

Theorem 4. Suppose that 0 < a < ^ and k < cminjp'^, for some 
constants c > 0 and 0 < 7 < ^. // ^ belongs to the dense loading regime 
(2.7), the minimax expected length for (1 — a) level confidence intervals of 
fd over 0 (fe) satisfies 

( 4 . 1 ) 


Note that the minimax rate in (4.1) is significantly different from the 
minimax rate ||?|| 2 (;^ + for the sparse loading case given in Theorem 

1. In the following, we construct a confidence interval CI^ Z) achieving 

the minimax rate (4.1) in the dense loading regime. Define 

(4.2) 


C2iX,k) = 822 


n 


min \ \X. 


j\\2 


max < 1.25, 


912 max IIX. 




(^X, k, 405 


max ||X.j ||2 


min \\X. 


j\\2 


It will be shown that C 2 {X,k) is upper bounded by a constant with over¬ 
whelming probability. The confidence interval CI^ Z) is defined to be, 


(4.3) Clf(^T^,Z) = 


r^-||^||ooP2 (fc),^T^+||^||ooP2 (k) 

{ 0 } 


on A 
on 


where A is defined in (3.10) and fd is the scaled Lasso estimator defined in 
(3.4) and 


(4.4) 


P2 {k) 


min 


C 2 (X, k) k 



(T,logp 



The confidence interval constructed in (4.3) will be shown to have the de¬ 
sired coverage property and achieve the minimax length in (4.1). A major 
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difference between the construction of CI^ Z) and that of Clf Z) 
is that CI^ Z) is not centered at a de-biased estimator. If a de-biased 
estimator is used for the construction of confidence intervals for with a 
dense loading, its variance would be too large, much larger than the optimal 

length IICIIooA^y^. 

4.2. Adaptivity of confidence intervals for fi in the dense loading regime. 
In this section, we investigate the possibility of adaptive confidence intervals 
for ffi fi in the dense loading regime. The following theorem leads directly to 
an answer to the adaptivity question (2.5) for confidence intervals for ffifi 
in the dense loading regime. 

Theorem 5. Suppose that 0 < a < ^ and A:i < A: < cmin ^p^, li^} 
some constants c > 0 and 0 < 7 < |. Then, for some constant ci > 0, 

(4.5) LI (0 ih) ,e{k), CT/3) > 

Theorem 5 implies the minimax lower bound in Theorem 4 by taking 
ki = k. If ki <C k, (4.5) implies 

(4.6) l; (0 (fci) ,Q{k), e/3) > cUWock^l^ » l; (0 (ki) , ef3) , 

which shows that rate-optimal adaptation over two different sparsity levels 
ki and k is not possible at all for any fci <C A;. In contrast, in the case of 
the sparse loading regime. Theorem 2 shows that it is possible to construct 
an adaptive conhdence interval in the ultra-sparse region k < , although 

adaptation is not possible in the moderate-sparse region ^ k < 

Similarly to Theorem 3, the following theorem establishes the strong non¬ 
adaptivity results for ^(3 in the dense loading regime. 

Theorem 6 . Suppose that 0 < a < ^ and k < cmin{p'^, for some 
constants c > 0 and 0 < 7 < Let q satisfies (2.7) and ki < {1 — C,q) k — 1 
for some positive constant 0 < Co < 1- Then for any 6* = (/3*, I, cr) G 0 (A:i) 
and C £ S there is some constant ci > 0 such that 


inf 

CIc(?T/3,Z)eXc(e(fc),«r/3) 


E0.L(CI„ iei3,Z))>ciU\\ook 



(4.7) 
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4.3. Minimax length and adaptivity of confidence intervals for A- 
We now turn to to the special case of T{j3) = the sum of all 

coefficients. Theorem 4 implies that the minimax expected length for (1 —a) 
level confidence intervals of Yl^i=i A ® (^) satisfies 


(4.8) 

The following impossibility of adaptivity result for confidence intervals for 
Z]i=i A ^ special case of Theorem 6 . 


Corollary 2 . Suppose that 0 < a < ^ and k < cmin{p'>', for 
some constants c > 0 and 0 < 7 < Let ki < (1 — (^ 0 ) ^ ~ 1 for some 
constant 0 < Co < 1- Then for any 9* = (/?*, I, cr) G 0 (fci), 


(4.9) inf E 0 *L 

cic(Eft,^)6X«(0(fc),Eft) 


CL 




fiijZ]] > cik 


logp 


-T, 


n 


for some constant ci > 0 . 


Remark 2. In the Gaussian sequence model, the problem of estimating 
the sum of sparse means has been considered in [5, 7] and more recently 
in [12]. In particular, minimax rate is given in [5] and [12]. The problem 
of constructing minimax confidence intervals for the sum of sparse normal 
means was studied in [ 6 ]. 


4.4. Computationally feasible confidence intervals. A major drawback of 
the minimax rate-optimal confidence intervals Clf (C^/3,^) given in (3.11) 
and CIq (C^/3, Z) given in (4.3) is that they are not computationally feasible 
as both depend on restricted eigenvalue k{X, k, ao), which is difficult to 
evaluate. In this section, we assume the prior knowledge of the sparsity k 
and discuss how to construct a computationally feasible confidence interval. 

The main idea is to replace the term involved with restricted eigenvalue 
by a computationally feasible lower bound function co (R, X, k) dehned by 


(4.10) uj{n,X,k) 


ydY^Amax (f^) 


9(1 + 405 


max ||V.j||2 \ 
min||A.jj|2 J 


min m 



The lower bound relation is established by Lemma 13 in the supplement [3], 
which is based on the concentration inequality for Gaussian design in [18]. 
Except for Amin (fl) and Amax (f^)) all terms in (4.10) are based on the data 
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{X,y) and the prior knowledge of A:. To construct a data-dependent compu¬ 
tationally feasible confidence interval, we make the following assumption, 
(4.11) 

sup Px (max 


^min -^min (f^) ; ^max (f^) -^max (f^) j ^ CcLri,pj — o(l 


where lim sup an,p = 0 and Q^i is a pre-specified parameter space for 0 and 
Px denotes the probability distribution with respect to X. 


Remark 3. We assume Gn is a subspace of the precision matrix de¬ 
fined in (2.4), jn : ^ < Amin (Al) < Amax (f^) < Mij. By assuming Gq is 
the set of precision matrix of special structure, we can find estimators 
satisfying (4.11). If Gn is assumed to be the set of sparse precision ma¬ 
trix, we can estimate the precision matrix Gl by CLIME estimator R pro¬ 
posed in [4]. Under proper sparsity assumption on R, the plugin estimator 
^Amin ) Amax ) Satisfies (4.11). Other special structures can also be 
assumed, for example, the covariance matrix is sparse. We can use the plugin 
estimator of the estimator proposed in [10]. 


With Amin (f^) and Amax (f^)) we define io (R, X, k) as 


uj{n,X,k) 


4 y Amax (f^) 


9 fl 

y min||A.j||2 


'^min (R) 



and construct computationally feasible conhdence intervals by replacing 
k, 405 ^ min||x ^||2 )) ™ with a;(R, X, k). 


5. Confidence intervals for linear fnnctionals with prior knowl¬ 
edge R = I and a = ao. We have so far focused on the setting where 
both the precision matrix R and the noise level a are unknown, which is 
the case in most statistical applications. It is still of theoretical interest to 
study the problem when R and a are known. It is interesting to contrast the 
results with the ones when R and a are unknown. In this case, we consider 
the setting where it is known a priori that R = I and a = and specify the 
parameter space as 


(5.1) e{k, I, ao) = {e = (/3,1, ctq) : ||/3||o < k}. 


We will discuss separately the minimax rates and adaptivity of confidence 
intervals for the linear functionals in the sparse loading regime and dense 
loading regime over the parameter space 0(A:,I, (Tq). 
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5.1. Confidence intervals for linear functionals in the sparse loading regime. 
The following theorem establishes the minimax rate of confidence intervals 
for linear functionals in the sparse loading regime when there is prior knowl¬ 
edge that 0 = 1 and a = gq. 

Theorem 7. Suppose that 0 < a < ^ and k < cmin{p'>', for some 
constants c > 0 and 0 < 7 < ^. // ^ belongs to the sparse loading regime 
( 2 . 6 ), the minimax expected length for (1 — a) level confidence intervals of 
fi over Q{kfi,ao) satisfies 

(5.2) 


Compared with the minimax rate for the unknown O and 

cr case given in Theorem 1, the minimax rate in (5.2) is significantly different. 
With the prior knowledge of O = I and a = gq, the above theorem shows 
that the minimax expected length of confidence intervals for fi is always of 
parametric rate and in particular does not depend on the sparsity parameter 
k. In this case, adaptive confidence intervals for ffifi is possible over the full 
range k < A similar result for confidence intervals covering all fii 

has been given in a recent paper [16]. The focus of [16] is on individual 
coordinates, not general linear functionals. 

The minimax lower bound of Theorem 7 follows from the parametric 
lower bound of Theorem 1. As both and g are known, the upper bound 
analysis is easier than the unknown 11 and g case and is similar to the one 
given in [16]. For completeness, we detail the construction of a confidence 
interval achieving the minimax length in (5.2) using the de-biasing method. 
We first randomly split the samples {X,y) into two subsamples 
and (x(^\y(^)) with sample sizes ni and n 2 , respectively. Without loss of 
generality, we assume that n is even and ni = n 2 = ^. Let fi denote the 
Lasso estimator defined based on the sample with the proper 

tuning parameter A = 


(5.3) 


/3 = argmin 


2ni 


1=1 


IX 


(i)i 


yfiril 




We define the following estimator of /3, 

(5.4) fi = ep + . 
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Based on the estimator, we construct the following confidence interval 
(5.5) ClL(eT/3,Z) = 


- 1 ni 11^112 - I 1 m 11^112 

+ 1-01 , - Zao/2(^0 






where uq = 70 a with 0 < 70 < 1. It will be shown in the supplement [3] that 
the conhdence interval proposed in (5.5) has valid coverage and achieves the 
minimax length in (5.2). 


5.2. Confidence intervals for linear functionals in the dense loading regime. 
In marked contrast to the sparse loading regime, the prior knowledge of 
0 = 1 and cr = (To does not improve the minimax rate in the dense loading 
regime. That is, Theorem 4 remains true by replacing 0 (k) and 0 (fei) with 
0(A:,I, CTo) and 0 (/ci, I, (Tq), respectively. However, the cost of adaptation 
changes when there is prior knowledge of 0 = I and a = aQ. The following 
theorem establishes the adaptivity lower bound in the dense loading regime. 

Theorem 8. Suppose that 0 < a < ^ and ki < k < cmin ^p'^, for 

some constants c > 0 and 0 < 7 < then, for some constant ci > 0 , 

(5.6) 

L* (0 (fci,I,ao), 0 (fc,I,cro) ,^''‘/I) > cill^llooO-o max |v/^yi^^,min 

The lower bound in (5.6) is attainable. For reasons of space, the con¬ 
struction is omitted here. Under the framework (2.5), adaptive confidence 
intervals are still impossible, since for ki <C k, 

LI (0 (fci, I, do), 0 {k, I, uo), » LI (0 (fei, I, (To), . 

Compared with Theorem 5, we observe that the cost of adaptation is reduced 
with the prior knowledge of II = I and a = ctq. 

6. Discussion. In the present paper we studied the minimaxity and 
adaptivity of confidence intervals for general linear functionals ffifi with a 
sparse or dense loading ^ for the setting where Ll and cr are unknown as 
well as the setting with the prior knowledge of U = I and (T = (Tq. In the 
more typical case in practice where H and cr are unknown, the adaptivity 
results are quite negative: With the exception of the ultra-sparse region 
for confidence intervals for ffi fi with a sparse loading it is necessary to 
know the true sparsity k in order to have guaranteed coverage probability 
and rate-optimal expected length. In contrast to estimation, knowledge of 


logp Vk 


k\ -, —r 


n 


n4 
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the sparsity k is crucial to constructing honest confidence intervals. In this 
sense, the problem of constructing confidence intervals is much harder than 
the estimation problem. 

The case of known 0 = 1 and u = do is strikingly different. The minimax 
expected length in the sparse loading regime is of order and in particular 
does not depend on k and adaptivity can be achieved over the full range of 
sparsity k < So in this case, the knowledge of O and a is very useful. 
On the other hand, in the dense loading regime the information on O and 
a is of limited use. In this case, the minimax rate and lack of adaptivity 
remain unchanged, compared with the unknown O and a case, although the 
cost of adaptation is reduced. 

Regarding the construction of confidence intervals, there is a significant 
difference between the sparse and dense loading regimes. The de-biasing 
method is useful in the sparse loading regime since such a procedure reduces 
the bias but does not dramatically increase the variance. However, the de¬ 
biasing construction is not applicable to the dense loading regime since the 
cost of obtaining a near-unbiased estimator is to signihcantly increase the 
variance which would lead to an unnecessarily long confidence interval. An 
interesting open problem is the construction of a confidence interval for /3 
achieving the minimax length where the sparsity q of the loading ^ is in the 
middle regime with cp'^ < <? < for some 0 < ? < 1 — 27 . 

In addition to constructing confidence intervals for linear functionals, an¬ 
other interesting problem is constructing confidence balls for the whole vec¬ 
tor (3. Such has been considered in [17], where the authors established the 
impossibility of adaptive confidence balls for sparse linear regression. These 
problems are connected, but each has its own special features and the be¬ 
haviors of the problems are different from each other. The connections and 
differences in adaptivity among various forms of confidence sets have also 
been observed in nonparametric function estimation problems. See, for ex¬ 
ample, [ 6 ] for adaptive confidence intervals for linear functionals, [13, 9] for 
adaptive confidence bands, and [ 8 , 20 ] for adaptive confidence balls. 

In the context of nonparametric function estimation, a general adapta¬ 
tion theory for confidence intervals for an arbitrary linear functional was 
developed in Cai and Low [ 6 ] over a collection of convex parameter spaces. 
It was shown that the key quantity that determines adaptivity is a geomet¬ 
ric quantity called the between-class modulus of continuity. The convexity 
assumption on the parameter space in Cai and Low [ 6 ] is crucial for the adap¬ 
tation theory. In high-dimensional linear regression, the parameter space is 
highly non-convex. The adaptation theory developed in [ 6 ] does not apply 
to the present setting of high-dimensional linear regression. It would be of 
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significant interest to develop a general adaptation theory for confidence 
intervals in such a non-convex setting. 


7. Proofs. In this section, we prove two main results, Theorem 3 and 
minimax upper bound of Theorem 1. For reasons of space, the proofs of the 
other results are given in the supplement [3]. 

A key technical tool for the proof of the lower bound results is the fol¬ 
lowing lemma which establishes the adaptivity over two nested parameter 
spaces. Such a formulation has been considered in [6] in the context of adap¬ 
tive confidence intervals over convex parameter spaces under the Gaussian 
sequence model. However, the parameter space Q{k) considered in the high 
dimension setting is highly non-convex. The following lemma can be viewed 
as a generalization of [6] to the non-convex parameter space, where the lower 
bound argument requires testing for composite hypotheses. 

Suppose that we observe a random variable Z which has a distribution 
Pg where the parameter 6 belongs to the parameter space "H. Let CIq, (T, Z) 
be the confidence interval for the linear functional T (9) with the guaranteed 
coverage 1 — a over the parameter space 71. Let T-Lq and be subsets of the 
parameter space Ti where Ti = TioUTii. Let vr-^. denote the prior distribution 
supported on the parameter space Tii for i = 0,1. Let (z) denote the 
density function of the marginal distribution of Z with the prior on Tii 
for i = 0,1. More specifically, {z) = J fe (z) Tr-Ui (9) d9. 

Denote by the marginal distribution of Z with the prior vr-^. on %i 
for i = 0,1. For any function g, we write {g (Z)) for the expectation of 
g {Z) with respect to the marginal distribution of Z with the prior on 
Ho- We define the distance between two density functions fi and /o by 


(7.1) x\fi,fo) = J 


ifijz) - Mz)y 

fo{z) 


dz = 


IM 

fo{z) 


dz — 1 


and the total variation distance by TV(/i, /o) = J \fi{z) — fo{z) \ dz. It is 
well known that 


(7.2) TV(/i,/o)< v'a2(/i,/o). 


Lemma 1. Assume T {9) = go for 9 G T-Lq and T (9) = gi for 9 G T-Li 
and T-L = T-Lo'A T-Li. For any CIq, (T, Z) G 2q (T,?^), 

(7.3) 

L (CIq (T, Z),n)>L (CIq (T, Z) , Tio) > \gi-go\ (l - 2a - TV 
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7.1. Proof of Lemma 1. The supremum risk over PLq is lower bounded 
by the Bayesian risk with the prior on Ho, 

(7.4) 

sup EgL (CI„ (T, Z)) > [EgL (CI„ (T, Z)) (0) dO = E_ L (CI„ (T, Z )). 

e&Ho Je 

By the definition of CIq, (T, Z) G Zq (T, H) , we have 


(7.5) (m* G CI, (T, Z)) = / Pe (/ii G Cl, (T, Z)) (0) d0 > 1 - a, 

' Je 

for i = 0,1. By the following inequality 

(Ml G Cl, (T, Z)) - G Cl, (T, Z)) < TV 




then we have P^„^ (mi G CI, (T, Z)) > l-a-TV(/^^^, This together 

with (7.5) yields (mo,Mi G CI, (T,Z)) > l-2a-TV(/^.„^,which 
leads toPjr.„^ (L (CI, (T, Z)) > |/xi - /io|) > l-2a-TY{fT,^^, Hence, 

E^.„^L(CI, (T,Z)) > (|Ui -/io) (l-2a-TV(/.n-.„^,/jr^J)+. The lower bound 
(7.3) follows from inequality (7.4). 


7.2. Proof of Theorem 3. The lower bound in (3.13) is involved with a 
parametric term and a non-parametric term. The proof of the parametric 
lower bound is postponed to the supplement. In the following, we will prove 
the non-parametric lower bound 


(7.6) inf 

cic(5T/3,^)eic(0(fc),V/3) 


Eg*L (CI, Z)) > ciWihk^^a. 


n 


for some constant ci > 0. Without loss of generality, we assume supp(^) = 
{!,••• , IlClIo}- We generate the orthogonal matrix M G such 

that its first row is ||^^supp(0 define the orthogonal matrix Q as 

Q = . We transform both the design matrix X and the regression 

vector (3 and view the linear model (2.1) as y = Vf) -|- e, where V = XQt 

( MB* 

^-supp(5) 

is of sparsity at most ||,C||o + The first coefficient ipi of is The 

covariance matrix T of Vi. is and its corresponding precision matrix 

is r = To represent the transformed observed data and parameter, 

we abuse the notation slightly and also use Zi = {yi, V).) and 9* = (V’*, I, t). 
We define the parameter space G {k) of {ijj, T, a) as 


(7.7) Q (k) = {{if, T, cj) : V’ = QfJ, T = QQQ''^ for {/3, H, cr) G 0 (fc)} . 
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For a given Q, there exists a bijective mapping between 0 (k) and G {k). To 
show that {'ip, r,a) G G {k), it is equivalent to show Q^TQ, a) G Q {k). 

Let la (0 {k) ,ipi) denote the set of confidence intervals for ipi = 
with guaranteed coverage over G (k). If Cla {ipi, Z) G la {G (k) , ipi) , then 
||e|| 2 CI„(^/>i,Z) G Ia{e{k),Cf3); If Cla{^^(3,Z) G Ia{e{k),CP), then 
T^^Cla {OP■, Z) G Za{G {k) j'lpi). Because of such one to one correspon¬ 
dence, we have 
(7.8) 


inf 

cic.(?T/3,z)ex«(0(fc),CT^) 


E0*L(CI„ (e^/3,^)) 


iieib 


inf E 0 *L(CI„(V’i,^)). 

cia.(ipi,z)&Xc(g{k),i)i) 


By (7.6) and (7.8), we reduce the problem to 

(7.9) inf E 0 *L (CIq, ('i/’i, •^)) > cA:—^cj. 

ci,,(i/.i,z)eia(e(A:),bi) n 


Under the Gaussian random design model, Zi = {yi,Vi.) G follows 

a joint Gaussian distribution with mean 0. Let denotes the covariance 

matrix of Zj. Decompose into blocks ^ j where T,yy, 

\^vy ^vv J 

and 'Efjy denote the variance of y, the variance of V and the covariance 
of y and V, respectively. We define the function h : —)■ {'ip,T,a) as 

h(s-) = ((E-J-i Y.ly, - {^lyf (E-J-1 E-J. The function h 

is bijective and its inverse mapping h~^ : {ip, T, cr) —)> is 


h ^ ((V’,r,c7)) 


/V’W-V + 0-2 ip-^r-^\ 

V r-V F-i ) ■ 


The null space is taken as 77o = {(V'*,I,f)} and vr-^u denotes the point 
mass prior at this point. The proof is divided into three steps: 

1. Gonstruct 77i and show that Tii C G {k); 

2. Gontrol the distribution distance TV ; 

3. Galculate the distance yi — /tq where yo = ipl and yi = 'ipi with 
(V^,r,cj) G T-Li. We show that yi = V'l is a fixed constant for all 
(V^,r,cj) G Hi and then apply Lemma 1. 

Step 1. We construct the alternative hypothesis parameter space 77i. Let 
Eq denote the covariance matrix of Zj corresponding to {'ip*,\,a) G 'Hq. Let 
Si = supp (ip*) U {1} and S = 5i\{l}. Let k^ denote the size of S and pi 
denote the size of and we have k^ < ki + q and pi > p — k^ — 1 > cp. 
Without loss of generality, let 5* = {2, • • • , fe* -|- 1}. We have the following 
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expression for the covariance matrix of Zi under the null, 


(7.10) 


m\i+cT^ 

ri 

1— 

Olxpi 

ri 

1 

Olxfc* 

Olxpi 

rs 

0 a :* x 1 

Ik*xk* 

xpi 

OpiXl 

X 1 

Opi X fc* 

IpiXpi 


To construct 77i, we define the following set, 

(7.11) 


^ ( Pd ^ ^ e 


l<^llo = y e {0, p} for 1 < i < Pi 


Define the parameter space T" for by T" = : S G i ^pi, 

where 


(7.12) 



r 

iry 

PO^T 

ri 

1 

Olxfc* 

(5T 

rs 

X 

0 

Ik* xk* 

l^fc* xpi 

Pq8 

5 

Opi X fc* 

Ipixpi 


Then we construct the alternative hypothesis space for which 

is induced by the mapping h and the parameter space T, 


(7.13) = {(iA,r,a) : (V^,r,cj) =/i(S^) for E .T} 

In the following, we show that 77i C Q (k). It is necessary to identify 
(V^jT, (t) = /i(S^) for E T" and show (Q^V^, Q^TQ, ct) E Q{k). Firstly, 
we identify the expression E (yi \ V)^.) under the alternative joint distribu¬ 
tion (7.12). Assuming = Vaipi + we have 

(7.14) , 'tps = = (po - -01) 

and 

(7.15) Var (e') = ~ < cr^ < M 2 . 

1 ITII 2 

Based on (7.14), the sparsity of in the alternative hypothesis space is 
upper bounded by 1 -|- |supp (V’J) | + |supp (5) | < ^1 — k, and hence the 
sparsity of the corresponding (3 = Q^'ip is controlled by 

ll/Sllo < (1 - I) 


(7.16) 


k + q < k. 
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Secondly, we show that 17 = Q^VQ satisfies the condition ^ < Amin (17) < 
Amax (f7) < Ml. The covariance matrix T of Vi^. in the alternative hypothesis 
parameter space is expressed as 


(7.17) T = 


1 

Olxfc* 



( 0 

Olxfc. 

dT 

o 

X 

J 

X 

J 

1 — 1 

0/c* Xpi 

+ 

o 

X 

X 

O 

0/c* Xpi 

X 1 


IpiXpi / 


V ^ 

Opi X fc* 

Opi xpi 


Since the second matrix on the above equation is of spectral norm ||<5||2, 
Weyl’s inequality leads to max{| Amin ('!') ~ 1| > |Amax (^) — 1|} < ll^lb- When 
||d ||2 is chosen such that ||d ||2 < min |l — Mi — l| , then we have 
< Amin ('h) < Amax (^) < Mi. Since 17 and T = have the same 

eigenvalues, we have ^ < Amin (17) < Amax (1^) < Mi. Combined with 
(7.15) and (7.16), we show that 77i C Q {k). 


Step 2. To control TV > U-ho)^ it is sufficient to control , fn-nj) 

and apply (7.2). Let vr denote the uniform prior on 6 over I ^pi, ^k,p^. 
Note that this uniform prior tt induces a prior distribution over the 
parameter space 77i. Let denote the expectation with respect to the 

independent random variables S, S with uniform prior tt over the parameter 
space I ^pi, ^k,p^. The following lemma controls the distance between 
the null and the mixture over the alternative distribution. 

Lemma 2. Let fi = — poV’i^ • Then 

(7.18) (^fnn^,Uno) +'^=^s,s(^^~ ^^PoiPO-^l) +• 


The following lemma is useful in controlling the right hand side of (7.18). 


Lemma 3. 



Let J be a Hypergeometric (p, k, k) variable with F {J = j) = 


(7.19) 


s,2 / k k 

E exp (tJ) < e7-^ (1-1— exp (t) 

\ P P 


Taking po = V^i + o', we have ^ (po (po “ V’l) + /i) = 2 and by Lemma 2, 
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By the inequality < exp(2x) for x G 


n 

u, 2 


if 


log 2 


then (1 — 26^S ) < exp ( AnS^S ). By Lemma 3, we further have 


jexp = Eexp {AJnp^^ < e-ipi-aco^ exp (4rep^) 


Co i. 
2 '' 


Cpfc^ 

< e"'ri“2Co'! 


Soi 


1 _ ^ + ^0^ / 4pi 


2pi 2pi V 


< e'‘ri“ 2 =Cop^ j 1 -|- 


1 




where the second inequality follows by plugging in p = 


log 


ipi 

£o^ 


Sn 


and the 


last inequality follows hy k < cp'^. If k < c | where 0 < 7 < ^ and c 


is a sufficient small positive constant, then kp^ < min -j (1 “ ) > 1 

and hence 

(7.20) 


< I 2 « 


and TV 


(/p'-Hl > /^'Hq 


^ 2 -“' 


Step 3. We calculate the distance between pi and po. Under 77o, ^0 = V’l- 
Under 77i, /ri = 'i/’i = ^ ^ ^ (ph %k, p ^, ||<5||| = and 

Pi = V’l = — 2 . Since p is selected as hxed, pi = ■01 is a fixed 

constant for ( 1 /^, 0, a) G 'Hi. Note that pi — po = = iZMp") bill'll 


11 Jt 112 2 2 

it follows that |pi — po| = cr^qj^ > ck —t. Combined with (7.2) and 
(7.20), Lemma 1 leads to (7.9). By (7.8), we establish (3.13). 


4pi 


7.3. Proof of upper bound in Theorem 1. The following propostion es¬ 
tablishes the coverage property and the expected length of the constructed 
conhdence interval constructed in (3.11). Such a confidence interval achieves 
the minimax length in (3.1). 

Proposition 1. Suppose that k < where c* is a small positive 

eonstant, then 

(7.21) liminf inf Eg (^t^ g (^^Jp^z)) > 1 — a, 

n,p^oo 6»e0{fc) 


















26 


T. T. CAI AND Z. GUO 


and 

(7.22) L (Clf Z), 0 (k)) < Clieib , 

for some constant C > 0. 


In the following, we are going to prove Proposition 1. By normalizing the 
columns of X and the true sparse vector j3, the linear regression model can 
be expressed as 

(7.23) y = Wd + e, with W = XD, d = D-^/3 and e ~ iV(0, cr^I), 
where 


(7.24) D = diag ( ^ ^ 

Vll -ilb/jeip] 

denotes the p x p diagonal matrix with {j,j) entry to be Take Jq = 

1.0048 and 7?o = 0.01, and we have Aq = (1 + 7?o) ■ Take eg = 

+ 1 = 202, no = 0.01, Cl = 2.25, cq = g and Cq = 3. Rather than use 
the constants directly in the following discussion, we use 6o,r]o,eo,no,Ci, Cg 
and cg to represent the above fixed constants in the following discussion. We 
also assume that < T and Jologp > 2. Define the li cone invertibility 
factor (CIFi) as follows, 

(7.25) 

„ „ / „ „ , J 

CIFi {ag,K,W) = mf I -- : \\uk4i < ao i,u / 0 > , 

IpaIIi 


where K is an index set. Define (T°™ = -^\\y — X(3\\2 = -^Wv — lT(i|| 2 , 
(7.26) 


T = {k : 141 > Ao4™}, r = (1 + eo) Aq max 




iij II 8Ao|r| _ 

^"ll^’(7/Fi(2eo + l,r,W) 


To facilitate the proof, we define the following events for the random design 
X and the error e. 


Gi= - 


2 1 


5 4Mr 


< 


IX 


j\\2 


n 


< - a/A 4 for 1 < j < p 
5 
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G 3 = < max 




G 4 = < k{X, k, a) > 


- 1 


u'^TiU 




- 1 


< 2 

9 


logP ^ plogpl 


n 


, where u = 


G5 = 


= 


iVF'^el 

n 

\W^e 


< a 


4v/A max (O) 

min m 

260 logp \ 


(1 -|- cr) \ k 


log pi 

■ T 


100 ^ ^ora 




n 

eo - 1 


{ ? 

(1 -t) ^ 


n eo + 1 

-S '2 = {(1 - z^o) T < cr < (1 + i^o)t} , 

fii = _ ^T||^ < ^ where A„ = 4GoMf ||^|| 2 Y^ 


logp 


n 


Define G = and S = Cif^iSi. The following lemmas control the 

probability of events G, S and Bi. The detailed proofs of Lemma 4, 5 and 
6 are in the supplement. 

Lemma 4. 

(7.27) Pe (G) > 1 - - - ^ p^~^° - d exp (-cn), 

p 2 Vvr()ologp 


and 

(7.28) 


{Bi) > 1 - 2 p^-"°^o, 


where c and d are universal positive constants. If k < then 

(7.29) 

Po + 1 ~ \/2fl'o + 1\ \ //I 


(Gn5) >P 0 (G)- 2 exp - 


n — c 


p 


1—<5o 


2 J J Vlogp 

where c* and d' are universal positive constants and go = 2 + 3 ^ 0 ' 

The following lemma establishes a data-dependent upper bound for the 
term \W - piU. 


Lemma 5. On the event G D S, 

\0-P\\i < (2 + 2eo) 


n 


min I lx. 


j \\2 


-l{Z,k), 


(7.30) 
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where 
(7.31) 

I {Z, k) = max < 


A:Ao(T°™, 


(2 + 2eo) max ||Xj||| + Aqo-^ k 

nK2('x,fc.(l + 2«)(2Sg# 


The following lemma controls the radius of the confidence interval. 
Lemma 6. On the event GCi S (1 Bi, there exists po such that if p > po, 

(7.32) « (k) < ciifib (^ + Lbir). < Iifib logp (T + thir),, 

/n n J V v ^ ^ / 


and 

(7.33) 


logp_ / 1 _ A jlogp^ 


P2 (k) < Ck\l < logp ( ky 


In the following, we establish the coverage property of the proposed con¬ 
fidence interval. By the definition of p in (3.6), we have 


(7.34) 


p - = -iBX^e + 

n \ 


We now construct a confidence interval for the variance term by 

normal distribution and a high probability upper bound for the bias term 
(yd — (5^. Since e is independent of X and u and S is a function 

of X, we have \ X ^ N (o, , and 


■.|x 


f-f — 




az. 


n 


n 


'a/2: 


-az, 


n 


a 12 


X =1-0. 


By (7.34), we have G CIq {Z, k)\X) = 1 — a, where 


Clo{Z,k) = 


p - - uTg 


'u'PZii 


-az, 


n 


'a/2: 


p - (yj - ( y -( d '^+ \/ 


n 
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Integrating with respect to X, we have 

(7.35) Pe G CIo (Z, k)) = [ P,|, G CIo {Z, k)\x) f{x)dx = 1 - a. 


Since 




T - STS 


< II^T _{tTS| 


— on the event S (1 G, 


Lemma 5 and the constraint in (3.5) lead to 


(7.36) ||^T_ys|U||^-/3||i < A^(2 + 2eo) . n l{Z,k), 

mm ||A.j||2 

where I {Z,k) is defined in (7.31). On the event G H 5, we also have a < 

(1 + i^o) d and a°^°' < (1 + t'o) 1 + + 2^^(T. We define the follow¬ 

ing confidence interval to facilitate the discussion, CIi {Z, k) = {J1 — lk,]l + Ik), 

where 4 = (1 + z^o) \/^^^a/ 2 ^ + Ci {X, k) On the event GnS", 

we have 


(7.37) CIo {Z, k) C CIi {Z, k). 

On the event S 2 , if p > exp( 2 M 2 ), then a < j^a < i^M 2 < logp. 
Hence, the event A holds and Clf (^t^^ = [Jl — pi{k), 'jl + pi{k)]. By 

Lemma 6 , on the event G O 5 O Bi, A p > max{po, exp ( 2 M 2 )}, we have 
Pi {k) = 4) and hence 

(7.38) CIi(Z,A:) = Clf(eT/3,Z). 

We have the following bound on the coverage probability, 

P, ({^T^ e Cif z)}) > Pe ({^T^ e CIo (z. A:)} n 5 n Gn Bi) 

>P0 ({^T^ g CIo {Z, k)}) - Pe ((5 n G n Bi)'^) = 1 - a - P, ((5 n G n Bi)'^) 
=P 0 (5nGnHi) - a, 

where the first inequality follows from (7.37) and (7.38) and the first equality 
follows from (7.35). Combined with Lemma 4, we establish (7.21). We control 
the expected length as follows, 

(7.39) 

E,L (Clf Z)) = EeL (Clf (^t^^ z)) U 
=E0 L (Clf (^T^^ z)) lAn(SnGnBi) + (Clf (^t^^ z)) lAn(SnGnSi)= 

<G||e ||2 (k^^ + ^)^ + 11^112 (logp)' ((5 n G n Hi)'^) 

<G||e ||2 [k^^ + ^^(^a + C (^/-min{5o.Ci,coGg} ^ (logp) 2 ) , 
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where the first inequality follows from (7.32) and second inequality follows 
from Lemma 4. If < c, then ^ (logp)^ —)• 

0, and hence E^L (Clf (e/3, Z)) < CH^Ib (A:^ + 
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SUPPLEMENTARY MATERIAL 

Supplement to “Confidence Intervals for High-Dimensional Lin¬ 
ear Regression: Minimax Rates and Adaptivity”. 

(http://www-stat.wharton.upenn.edu/~tcai/paper/CI-Reg-Supplement.pdf). 
Detailed proofs of the adaptivity lower bound and minimax upper bound for 
confidence intervals of the linear functional (3 with a dense loading ^ are 
given. The minimax rates and adaptivity of confidence intervals of the lin¬ 
ear functional (3 are established when there is prior knowledge that D = I 
and (T = (To. Extra propositions and technical lemmas are also proved in the 
supplement. 
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